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STATUS OF THIS MEMO 

This document is an Internet -Draft and is in full conformance with 
all provisions of Section 10 of RFC2026. 

Internet -Drafts are working documents of the Internet Engineering 
Task Force (IETF), its areas, and its working groups. Note that 
other groups may also distribute working documents as Internet- 
Drafts . 

Internet -Drafts are draft documents valid for a maximum of six months 
and may be updated, replaced, or obsoleted by other documents at any 
t ime . It is inappropriate to use Internet- Drafts as referenc e 
material" or to cite them other than as work in progress. 



The list of current Internet -Drafts can be accessed at 
http: //www. ietf . org/ietf /lid-abstracts . txt 

The list of Internet-Draft Shadow Directories can be accessed at 
ht tp : / /www . ie t f . org/shadow , html . 



Abstract 

The Session Initiation Protocol (SIP) can support multi-party 
conferencing in many different ways. I n this draft, we define the 
various multi -party c^ ^^ni^n mr^o^ discuss how 

they are used and then analyze their relative benefits and drawbacks. 



1 Introduction 

The Session Initiation Protocol (SIP) [1] has been defined for the 
establishment, maintenance, and termination of calls between one or 
more users. However, despite its origins as a large scale multip arty 
c onfere ncing protocol. SIP is used today primarily for point to^bint 
calls. This configuration is the focus of the SIP specification and 
most of its extensions. As a result, there is a lot of confusion 
about how SIP supports multi -party conferencing. 
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We seek to remedy this problem by describing, in a consistent and 
complete fashion, the various multi -party conferencing models 
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supported by standard SIP. For each model, we discuss: 
o How the model works . 
o How users are invited to join. 

o How users can join an existing conference without being 
invited 

o How well the model scales. 

o Which entities need to be aware of the model. 

o How participants learn about each other. 

We also identify missing pieces and recommend standard activity to 
fill them in. This document itself does not define any new extensions 
of any kind. 



The first model we call "end system mixing". In this model, user A 
caixs user b, and they have a conversation. At some point later, A 
decides to conference in user C. To do this, A calls C, using a 
completely separate SIP call. This call uses a different Call -ID, 
different tags, etc. There is no call set up directly between B and 
C. A receives media streams from both B and C, and mixes them. A 
sends a stream containing A's and C's streams to B, and a stream 
stream containing A's and B«s streams to C. 

This model is depicted graphically in Figure 1 . 



Basically, user A handles both signaling and media mixing. B and C 
are unaware of the multi -party call, from a SIP perspective at least. 
From an RTP perspective, A is a mixer, and so the RTCP reports from A 
will contain SDES information that indicates the existence of an 
additional party in the media stream. 

Note that this model has the serious drawback that the conference 
ends when the mixing UA leaves the call. 

2.1 Inviting Users to Join 

Any user in the conference can invite another user to join, so long 
as they are capable of performing the required mixing and signaling 
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Figure 1: Three Way Calling using End System Mixing 



functions. To invite a new user to join, a user in the conference 
simply calls them using normal SIP procedures. The only difference is 
that the stream sent to that new user contains the streams received 
from the other parties in the call. 

in fact, it is perfectly acceptable for complex connectivity graphs 
to be constructed, as a result of different users inviting other 
users to join. For example, take our case of A calling B, and then 
calling C. If, later on, C calls D, C will performing the mixing of 
the streams it gets from A (which actually contain media from A and 
B) , along with its own stream, and send that to D. This results in a 
connectivity graph that looks like: 
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Note, however, that there is a possibility of loops. From here, if D 
calls B, and brings that stream into the conference, a loop is 
created. This loop can be detected using the mechanisms described in 
the RTP specification [2] . However, we expect these conditions to be 
extremely rare. Presumably, D knows B is in the conference already, 
and so would not likely call B and invite them in. 



2.2 Users Joining 



In this model, there is not^a ny explicit conference "identifier" that 
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can be used to join. This conference model, by its nature, is built 
around ad-hoc conferences* However, it is still possible for a user 
to join in the following way. 

Lets say a new user, E, simply calls B, unaware even, that B is in a 
conference (E might actually be aware, but the SIP messaging is no 
different) . B's softphone, recognizing that B is already in a 
conference, asks B if E should be brought into the conference right 
away. If B clicks "yes", the call to E is answered. The media stream 
sent to E contains media from B, along with the media B is already 
receiving from A. 

If B had instead clicked no, E can easily be added to the conference 
later. No SIP signaling at all is needed to do this. B simply starts 
sending the mixed media to E. 

2 . 3 Scalability 

A drawback of this model is its scalability. Viewing the conference 
from a graph perspective, if the number of edges touching a vertex 
(its degree) equals N, the user corresponding to that vertex has to 
perform up to N separate media stream encodings. We say "up to", as 
it depends on the number of paricipants who are talking at once. If 
only one pariticpant is talking, the non- talking "mixer" endpoints 
don't need to do any additional encoding. If everyone is talking, it 
is N encodes. Since encoding is generally a complex process, a 
typical workstation these days can handle two or three simultaneous 
encodes using a low rate codec like G. 723.1. The problem can be 
mitigated somewhat by distributing the mixing responsibilities 
(making the graph deep rather than wide) . However, this requires a 
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conscious effort of the participants regarding who is to make the 
call to add a new user. This is unlikely to happen in practice. 

Another limitation to scalability is bandwidth. If the degree of a 
vertex is N, the user needs enough bandwidth to send and receive up 
to N streams, for a total of 2N. On a 56K modem, using a G. 723.1 
codec, this limits the degree to two (remember RTP overheads) * This 
limitation exists even if only one user is talking. In this case, a 
mixing host receives the encoded packet stream, and needs to send a 
copy to each participant it is connected to. 

For these reasons, this conferencing model is ideal for three-way 
conferences (i.e., degrees of two), but doesn't scale up much higher. 

2.4 Location of Service Logic 

This model does not require any extension to SIP in order to work. It 
does require knowledge of this mechanism within the UA performing the 
mixing. Non -mixing participants do not need to know anything special. 

2.5 Discovering Participant Identities 

The identities of other participants in the conference is NOT known 
through SIP . Rather, it is learned through RTP. UAs with degrees 
greater than one are RTP mixers. As such, they take the RTCP SDES of 
the streams they mix, and aggregrate them into the RTCP stream sent 
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out. Since RTCP messages are sent infrequently, there may be a delay 
between when a user joins, and when their presence is known to the 
other participants. 

targe- Scale Multicast Conferences 

Large-scale multicast conferences were the original motivation for 
both the Session Description Protocol (SDP) [3] and SIP. In a large- 
scale multicast conference, o ne or more multicast addresses are 
a llocate d to the conf erence (more than one may be needed if layered 
encodings are in use) . Each participant joins that multicast groups, 
and sends their media to those groups. Signaling is not sent to the 
multicast groups. The sole purpose of the signaling is to inform 
participants of which multicast groups to join. 

Large-scale multicast conferences are usually pre-arranged, with 
specific start and stop times (which is why this information exists 
in SDP) . Protocols such as the Session Announcement Protocol (SAP) 
[4] are used to announce these conferences. However, multicast 
conferences do not need to be pre-arranged, so long as a mechanism 
exists to dynamically obtain a multicast address. SAP itself was 
originally used for this purpose; this has been supplanted by the 
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malloc architecture [5], still under development. 

So, if there are N participants, there will be point to point SIP 
relationships with pairs of participants. Each participant sends a 
single media stream to the group, and receives up to N-l streams at 
any time. Note that the number of streams that a user will receive 
depends on who is actually sending at any given time. If the stream 
is audio, and silence suppression is utilized, the number of streams 
a user will receive at any given time is equal to the number of users 
talking at any given time. Even for very large conferences, this is 
usually just a small number of users. 



3.1 inviting Users to Join 

inviting users to join is simple. Any user may invite any other user 
to join. The SIP INVITE request contains SDP that indicates multicast 
addresses for each media line. The SDP in the 200 OK response may 
actually be empty. From Section B.3 of RFC2543; 

For multicast, receive and send multicast addresses are the 
same and all parties use the same port numbers to receive 
media data, if the session description provided by the 
caller is acceptable to the callee, the callee can choose 
not to include a session description or MAY echo the 
description in the response. 

The called party then joins the multicast groups indicated in the 
SDP, using multicast protocols such as I(3*P [6 J . Note that it is not 
even necessary for users to send each other BYE messages when the 
conference is over, especially for large-scale, pre-arranged 
conferences that have explicit end times indicated in SDP. SDP aside, 
a participant can simply leave the conference at any time by leaving 
the multicast groups. No SIP signaling is needed to accomplish this. 
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3.2 Users Joining 



users can join a conference of this type without being invited. All 
they need is the multicast addresses, ports, and codecs being used. 
These can be obtained through any number of means, including SAP. SDP 
conference descriptions can even be obtained from web pages, for 
example. 

Once the addresses are obtained, the user simply joins the 
appropriate multicast groups. Note that absolutely no SIP signaling 
is required in this case. 

3.3 Scalability 
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The scalability of conferences of this type is can be excellent, 
especially for audio conferences. However, it is scalable under the 
assumption that multicast itself can scale to very large groups. 
Indeed, in local networks, protocols like DVMRP [7] and PIM-DM have 
tremendous scalability for conferences with very large numbers of 
members (the so called dense modes) . Given the existence of scalable 
multicast, the primary bottleneck to scalability of this conference 
type is the periodicity of RTCP reporting. Work has been done on 
improving the problematic cases [8] so that conferences with well 
over a million members are possible. 

Scaling is a bit harder for video conferences. Unlike voice, where 
silence suppression allows for no data to be sent during periods of 
inactivity, the same is not the case for video. This makes it hard to 
scale without flooding users with lots of video packets. 

Security is also hard for multicast conferences. Group key 
management, especially when users leave the group, is very complex. 

Unfortunately, multicast has not been widely deployed across 
backbones (some do, like Internet2, but they are the exception rather 
than the rule) . The MBone has collapsed, for all intents and 
purposes. Very few ISPs support multicast. As a result, wide area 
conferences are not really viable using multicast. However, these 
conferences are very suitable for LAN or enterprise conferences, 
where multicast is often deployed. 

3.4 Location of Service Logic 

This conferencing model does not require any SIP extensions. It does 
require that SIP UAs are prepared to receive SIP invitations with 
multicast addresses in the SDP. These UAs need to be prepared to 
mirror the SDP in the response. They should also be prepared to never 
receive a BYE for the conference. 

3.5 Discovering Participant Identities 

The identity of the participants in the session is learned entirely 
through RTCP. Each user a group multicasts RTCP packets with their 
name, email address, and so on. Note, however, that in large 
conferences, there may be significant amounts of time between a 
participant joining, and sending of their first RTCP SDES packet 



(this is for receivers only; senders will become known much faster) . 
ial-ln Conference Servers 



Di al -In conference servers closely mirror dial -in conference bridges 
inTEe traditional PSTNT — — — 
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A dial -in conference server acts as a normal SIP UA. U sers call i t, 
and the server maintains point to point SIP relationships with each 
user that calls in. The s erve r takes the media from the users who 
dial into the same conference, m ixes the m, and sends out the 
appropriate mixed stream to each participant separately. 
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Figure 2: Dial -In Conference Servers 



The model is depicted in Figure 2. Note that each UA (A,B,C,D) has a 
point to point SIP and RTP relationship with the conference server. 
E ach call has a different Call -I D. Each user sends their own media to 
the server. The media delivered* to user A by the server is the media 
mixed from users B,C and D. The media delivered to user B by the 
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server is the media mixed from users A, C and D. The media delivered 
to user C by the server is the media mixed from users A, B and D. The 
media delivered to user D is the media mixed from users A, B and C. 

The conference is identified by the request URI of the calls from 
eacfipartlcj.pant . T his provides numerous advantages from a s ervices 
and routing point of view— 19] . For example, one conference on the 
se rver might b e known as sip: co nf erence34®servers . com. A ll users who 
call sip: conf er ence34@servers.com are mixed toge ther . 

Dial -in conference servers are usually associated with pre-arranged 
conferences. However, the same model applies to ad-hoc conferences. 
An ad-hoc conference server creates the conference state when the 
first user joins, and destroys it when the last one leaves. The SIP 
and RTP interfaces are identical to the pre-arranged case. 

Since conferencing servers are nothing more than SIP UASes, they can 
use any of the procedures SIP allows a UAS to use. This includes 
authentication. So, for example, a specific conference may have a 
password associated with it. Users who join are challenged (with a 
401) using digest authentication. The realm, in this case, would 
identify the conference. The INVITE that comes back would have an 
Authorization header that includes the response to the challenge - 
the name of the user trying to join the conference, and the 
conference password, hashed as defined in [10] . 

Conferences can also limit the number of participants, when a new 
user tries to join, but the conference is full, the conference server 
can just reject the request with a "500 Conference Full" response. 

4.1 inviting Users to Join 



Inviting users to join is done using the SIP REFER message Ell] . If 
user A wishes to ask user B to join, A would send B a REFER that 
looks like: 




REFER sip: Bdexample.com SIP/2.0 
From : sip : Adexample . com 
To : sip: B@exampl e . com 

Re f er-To : sip : conf erence34® servers . com 



This would cause B to send an INVITE message to the conference 
server : 
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I NVITE sip: conf erence34® server s^ com 
From: sip: Bdexample.com 
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To: sip: conf erence34©servers . com 
Ref erred-By : sip : Adexample . com 



S ince the request URI identifies th e conference, this will cause B to 
get added to conference 34 . 



4.2 Users Joining 

Users joining is easily done. The participant that wishes to join 
simply sends an INVITE to the conference server, with the conferen ce 
ID, in fhff i**gwMUL-JIRl. Thfe^conf erence ID (which is a~S!P UR L) f can be 
learned by any number of means, including having it on a web page, 
receiving it in an email, etc. 



For example, if B wishes to join sip: confer ence3 4® servers. com, B 
would send the following request: 



INVITE sip : conf erence34@servers . com 

From : sip: BGexampl e . com 

To : sip : conf erence34Gservers . com 



4.3 Scalability 

The scalability of this model is limited by the bandwidth and 
processing power of the conference server, if there are N 
participants in a conference, M of which are sending media streams, 
the server will need to manage N signaling relationships, perform N 
RTP stream decodes, and N RTP stream encodes (assuming M > 0) . The 
encoding is the primary processing bottleneck, and the sending of the 
N media streams is the primary bandwidth bottleneck. However, 
conference servers can be built using heavy duty hardware, and have 
high bandwith access. 

Furthermore, s ince we are using the request URI to nam e the 
c onference s , we can use standard SIP techniques f ©redistributing 
conferences across servers [9] . 

4.4 Location of Service Logic 

The SIP UA of the conference participants does not require any 
special processing. The RTP implementation in those clients, however. 
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should support RTCP and be prepared to receive contributing sources. 

All of the new logic for providing this service resides in the 
conferencing server. No SIP extensions are needed, simply logic that 
resides above the SIP stack to manage the conferencing service. 

4.5 Discovering Participant Identities 

The identities of other participants in the conference are NOT known 
through SIP . Rather, it is learned through RTP. THe conference server 
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is an RTP mixer. As such, it takes the RTCP SDES of the streams it 
mixes, and aggregrates them into the RTCP stream sent out. This will 
allow participants to gradually (over a few seconds), learn the 
identities of the other participants. 



in an ad -hoc centralized conference, two users A and B start with a 
normal SIP call. At some point later, they decide to add a third 
party. Instead of using end system mixing, they would prefer to use a 
conference server, as defined in Section 4. 

This model corresponds roughly to the centralized multipoint 
conference model of H.323. 

One of the participants takes responsibility for transitioning to a 
conference server. The first step in this process is the discovery of 
a conference server that supports ad-hoc conferences. This can be 
done through static configuration, or through any of a number of 
standard service discovery protocols, such as the Service Location 
Protocol [12] . 

O nce the server is di scovered, a conference ID is chos en. This ID 
must be globally unique. The conference id is then prepended to the 
server, and a SIP URL for the ad-hoc conference is f ormed. For 
example, if the server "a. server s.com"^ is used, and the unique ID is 
»a7hytaskp09878a", the SIP URL for this conference is 
sip : a7hy taskp09 87 8a©a . servers . com. 

The user who is performing the transition (say, user A) then sends an 
INVITE to this URL. This creates the initial conference state in the 
server. A then sends a REFER to the other party in the call (say B) , 
referring them to s ip:a7hytaskp09 87 8a@a. servers .com. B sends an 
INVITE to this address, and is added to the conference. Once the 200 
OK response to the REFER is sent from B to A, A hangs up to B. A and 
B are now in a conference using a conference server. From here, 
operation is identical to the system described in Section 4 . 
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It is also possible to transition from a end system mixed conference 
(even one with a complex connection topology) , to a centralized 
conference server. One user takes responsibility for initiating the 
transition, it proceeds as described above. However, the REFER 
request is sent to all SIP peers adjacent to the user, in addition, 
when a SIP UA receives a REFER, they must not only act on it as 
described above, but also generate a REFER to any of their adjacent 
SIP peers, in essence, the REFER message is propagated along the 
connection graph, starting at the root (which is the user who 
initiates the transition) . The transition will work so long as the 
graph has no cycles (which is needed anyway, as discussed above) , and 
so long as only one user attempts to initiate the transition. If 
multiple users attempt to initiate the transition at the same time, 
the conference will break into two disjoint ad-hoc conferences, with 
membership depending on the temporal dynamics of the REFER 
propagation . 

5.1 Inviting Users to Join 




Once the ad-hoc conference has been created on the server, inviting 
users proceeds as defined in Section 4.1. 



5 . 2 Users Joining 



Once the ad -hoc conference has been created on the server, joining 
proceeds as defined in Section 4.2. 



5.3 Scalability 



The scalability of this conference model is identical to that of 
dial -in conference servers, as described in Section 4.3. 

5.4 Location of Service Logic 

The logic for handling the transition process must be located in at 
least one UA in the conference. All UAs that are mixers in a end 
system mixed conference must know to propagate the REFER requests 
they receive during the transition. 



5.5 Discovering Participant Identities 



0> 



Once the ad-hoc conference is established, conference identities are 
determined through RTCP, as in the dial -in case. 

ial-Out Conferences 

Dial -out conferences are a simple variation on dial -in conferences. 
Instead of the users joining the conference by sending an INVITE to 
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the server, the server chooses the users who are to be members of the 
conference, and then sends them the INVITE. Typically dial out 
conferences are pre-arranged, with specific start times and an 
initial group membership list. 

Once the users accept or reject the call from the dial out server, 
the behavior of this system is identical to the dial -in server case 
of Section 4. Thus, a dial-out conference server will generally need 
to support dial -in access for the same conference, if it wishes to 
allow joining after the conference begins. 

Note that, from the participants perspective, they will learn the 
conference identity (the URL) from the From field in the INVITE 
messages received from the server. 

6.1 Inviting Users to Join 

Once the conference is established, inviting users to join is 
identical to the scenario described in Section 4.1. Note that the URL 
to be used in the REFER is obtained from the From field of the INVITE 
received from the dial -out server. 

6.2 Users Joining 

Once the conference is established, joining is identical to the 
scenario described in Section 4.2. Note that the URL to be used in 
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the INVITE of new participants is obtained from the From field of the 
INVITE received from the dial -out server by the initial participants. 

6.3 Scalability 

The scalability of this conference model is identical to that of 
dial -in conference servers, as described in Section 4.3. 

6.4 Location of Service Logic 

The SIP UA of the conference participants does not require any 
special processing. The RTP implementation in those clients, however, 
should support RTCP and be prepared to receive contributing sources. 

All of the new logic for providing this service resides in the 
conferencing server. No SIP extensions are needed, simply logic that 
resides above the SIP stack to manage the conferencing service. 

6.5 Discovering Participant Identities 

Once the conference is established, conference identities are 
determined through RTCP, as in the dial -in case. 
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In this conferencing model, there is a centralized controller, as in 
the dial-in and dial-out cases. However, the centralized server 
handles signaling only. The media is still sent directly between 
participants, using either multicast or multi -unicast . Multi-unicast 
is when a user sends multiple packets (one for each recipient, 
addressed to that recipient) . This is referred to as a "Decentralized 
Multipoint Conference" in H.323. Interestingly, this conference model 
is possible baseline SIP. 

It works through third party call control [13] . The conference server 
uses re -INVITES to each participant when a new one joins. The re- 
INVITES add a media stream that gets sent to the new participant (and 
similarly in the reverse direction) . 

Let us assume for the moment that a conference already exists with 
three participants, in this state, each participant is sending media 
directly to each other. This is because the SDP that the conference 
server has given to each participant contains three media lines, each 
of type audio, with connection addresses and ports corresponding to 
each of the three users. 

The call flow from here is shown in Figure 3. A new participant 
joins the conference. It does so by sending a n^ INVITE (1 ) tn t he 
server, wi fft Hi? ?nnf fl ^nce id in the-XS0uefl?€njRI . The SDP in the 
INVITE contains a single media stream, with an IP address and port 
where it would like to receive media (D) . The 200 response from the 
conference server (2) contains a single media line with an IP address 
of 0.0.0.0 and a random port, indicating hold. 

The next step is for the server to obtain two more addresses where 
the new participant will be receiving media (it already has one from 
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the original INVITE). To do this, it sends a re- INVITE to the new 
• participant (4). This re INVITE contains two additional media streams 

(for three total) , all three of which are on hold. The 200 response 
to the re- INVITE (5) contains two additional IP addresses and ports 
where the user is willing to receive media. 

Now the server needs to inform the other parties that they should 
begin sending media to the new user. It first sends a re -INVITE to 
user C (7) . This re -INVITE adds an additional media stream to the two 
already that C has been sending. This new media stream uses one of 
the three connection addresses and ports returned by D in message 
(5). Call this address/port Dl . The other two are D2 and D3. The 200 
OK response from user C (8) contains the address and port where C is 
willing to receive a new, third media stream. Call this port C3. The 
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server holds on to this port, as it will use it later on, sending it 
to D, so that D sends media there. At this point, however, C can 
begin sending media to D. 

This re- INVITE process happens for B and for A as well. In the re- 
INVITE to B (10) , the server adds an additional media line (above the 
two already in use by C) using address/port D2 . The response (11) 
contains a new address/port to send media to B. Call this port B3. In 
the re - INVITE to A (13), the server adds an additional media line 
using address/port D3 . The response (14) contains a new address/port 
to send media to A. Call this port A3. 

Finally, the server sends a re - INVITE (15) to the new party. This 
re -INVITE takes all three streams off hold, and updates their 
connection addresses and ports with C3, B3, and A3, respectively. The 
200 OK response (16) returns the same ports and addresses returned in 
message (5) (as noted in [13] , these addresses /ports MUST NOT 
change) . Now, D can send media to A, B and C. 

The result of these manipulations is, indeed, a full mesh of unicast 
RTP streams between all participants. Unlike the case of end system 
mixing, the stream sent by any participant to all of the others is 
identical. Each particpant needs to mix, but it mixes the media it 
receives, and plays that out the speakers. This is normal behavior 
for multiple streams of the same type. Note that the SIP relationship 
is still point-to-point. There are four calls at the end of Figure 3, 
one from each participant to the server, each with a different Call- 
ID. 

Note that hybrids are easily possible. Certain users can instead be 
mixed (sending audio to the conference server) , while others are set 
to send audio to each other. 

7.1 Inviting Users to Join 

Inviting users to join works identically to the dial- in conference 
bridge scenario 4. 

7.2 Users Joining 

A user joins in the same way described in section 4. 
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7.3 Scalability 



The scalability of this conferencing model depends on many factors. 
From a media perspective, the conference server never even touches a 
single media stream. However, for N participants, each participant 
needs to be able to receive, decode, and mix N-l media streams. For 
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users accessing the server through dial-in modems, this will severely 
limit the sizes of these conferences. However, the processing burden 
is much less than that of the end system mixing model. This is 
because each end user needs to decode N-l streams, but only encode 1. 
Decoding is much, much cheaper than encoding, so supporting many 
decodes is not necessarily a problem. This is especially the case 
when silence suppression is in use. In that case, streams are only 
sent by talking users. This means any given user only needs to decode 
(and receive) as many streams at a time as there are users talking. 
THis can vastly improve scalability of the conference. 

There is a signaling burden on the server, however. If there are N 
users in the conference, addition of a new user (the N+lth) requires 
N+3 INVITE transactions, each of which has three messages. Similarly, 
departure of a user requires N BYE transactions, each of which has 2 
messages. For large N, and highly dynamic conferences, this can 
represent a potential burden. However, we believe this bottleneck is 
much farther out than the processing and bandwidth bottlenecks at the 
end users. 

For these reasons, we believe this conference model is ideal in 
corporate enterprises, where bandwidth is more plentiful and PCs are 
generally faster. 

7.4 Location of Service Logic 

Nearly all of the logic for implementing this conferencing service 
lives in the server itself. 

The only requirement from the end users is that they support 
multiple, parallel media streams of the same type, and that they be 
prepared to mix those streams together. They must also support the 
third party control primitives [13], which don't require anything 
beyond baseline SIP, but are not likely supported unless explicit 
actions are taken to do so. 

It is this combination - no need for media processing in the server, 
combined with no need for specialized SIP processing in the end 
systems, that makes this model attractive. 

7.5 Discovering Participant Identities 

Conference identities are discovered through RTCP. Each user will 
receive N-l RTP streams, each of which has its own RTCP channel that 
carries the participant identification. 
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Table 1 shows a summary of the differences between the various 
models. 
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Table 1: Summary of Models 
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9 Whats Missing - Full Mesh 

The sections above cover a wide range of conferencing models , but not 
all of them. One model, in particular, is not supported by SIP . That 
model is the fully distributed multiparty model. 

in this conferencing model, each user has a point to point SIP 
relationship with every other user. Each user also has a point to 
point RTP relationship with every other user, as is done in the 
decentralized conference of Section 7. 

Two earlier drafts were written on the subject, but they specified 
protocols that were overly complex and still had race conditions and 
unhandled cases. The primary difficulty is that it requires every 
participant to learn the identity of every other participant. As 
participants come and go, this requires some kind of state flooding 
mechanism that causes this information to propagate, and eventually 
converge, across participants. While these kinds of distribution 
mechanisms have been done for multiparty conferences [14] Fitting 
such a distribution mechanism into SIP is not trivial, especially 
with the complex requirements that were initially targeted. 

Furthermore, the distributed nature of the signaling makes 
enforcement of any kind of conference policy pretty much impossible. 
Failures can also result in unusual conditions. Specifically, it is 
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fairly easy for the conference mesh to break in certain places, 
resulting in a graph where every user hears most of the other users, 
but not all. This can happen, for example, if user A is invited into 
a conference, but is rejected by one of the users already into the 
conference (because the SIP relationships are point-to-point, a new 
user needs to establish a SIP call with all existing participants) , 
this situation can occur. With large conferences, this becomes a very 
real possibility. Earlier work tried to avoid such conditions. 

We believe a solution can be found by simplifying the requirements. 
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For example, we will abandon the requirement to only add a user to 
the conference if all other users agree to add them. We will also try 
to achieve gradual convergence in shared state, rather than the rapid 
convergence proposed in previous work. We will not worry about 
message efficiency or message frequency. The primary design objective 
should be KISS. 

As a baseline model, we believe that each INVITE, 200 OK response, 
and ACK simply contain a header called Members. This header is a list 
of URLs, and- for each XJRL, there is a parameter that indicates 
whether they are in the conference right now, and when they joined, 
or whether they were previously in the conference, and when they 
left. A UA simply performs a re- INVITE as it receives new 
information. A periodic re -INVITE (ala session timer [15] will also 
be needed to heal partitions and deal with other conditions that may 
arise) . 

More work is needed to validate the model and to see what other 
capabilities are needed. 

10 Security Considerations 

The use of a server that performs the mixing on behalf of other 
users, which is the case for all but one of the conference models 
described here, introduces security risks. That entity must be 
trusted by the others to properly mix the media - not omitting a 
stream, for example. As such, it is recommended that participants in 
a conference authenticate the identity of the server, in the dial- in, 
dial-out, and decentralized conferences, this will require 
authentication of responses by participants. 

Mixing also eliminates the privacy possible with end- to- end media 
transport with mixing in the receivers. Such privacy is still 
possible in the large-scale multicast conferences, but requires 
shared keying material for the conference. Doing this for highly 
dynamic groups is still an open research problem. 

11 Conclusion 
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in this draft, we have shown how to use baseline SIP (assuming 
endpoints that support the mixing and/or third party call control 
feature sets) to construct several multiparty conferencing models. 
These include end system mixing, large-scale multicast conferences, 
dial -in conference servers, dial -out conferences, ad- hoc centralized 
conferences, and centralized signaling, distributed media 
conferences . 

We note that this covers all of the multipoint conferencing models 
described in H.323vl [16] . Further work is needed to see how (and if) 
to support the hierarchical conference bridges defined in H.323v2 
[17] . 
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